================================================================================================================

Introduction

The task of the project was to analyze the wine data for the Portuguese “Vinho Verde” wine. The data set includes over 10 variables which pertain to the chemical composition of the wines and a resulting categorical variable of quality which is obtained by an average ranking task performed by 3 wine experts. The analysis in this project will attempt to determine the relationship between the chemical contents of wine and its quality rating.

Overview of the Data

Size of the Wine Data Sets

The data set for White wines has more than 4500 entries for 11 variables. Below mentioned are the dimensions of the data set used for this report.
## [1] 4898   18
Data set for Red wines has more than 1500 rows with same number of variables as the White wine data set.
## [1] 1599   18
NOTE: There are 4 additional variables/features created in order for better representation of data. These will be discussed in the analysis section.

Overview of the Variables

Following are the names of the variables and brief description of the data types for the two data sets. An additional variable was defined in both the data sets to differentiate between the types of wines. This variable will be useful for combined and comparative analysis of the two data sets.

White wine overview

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "type"                 "combined.acidity"    
## [16] "s.a.ratio"            "taste"                "taste.due.to.pH"
## 'data.frame':    4898 obs. of  18 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ type                : Ord.factor w/ 1 level "White": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined.acidity    : num  7.63 6.94 8.78 7.75 7.75 8.78 6.68 7.63 6.94 8.75 ...
##  $ s.a.ratio           : num  2.713 0.231 0.786 1.097 1.097 ...
##  $ taste               : Ord.factor w/ 4 levels "Dry"<"Medium_Dry"<..: 3 1 1 2 2 1 2 3 1 1 ...
##  $ taste.due.to.pH     : Ord.factor w/ 4 levels "Dry"<"Medium_Dry"<..: 3 2 1 2 2 1 2 3 2 1 ...

Red wine overview

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "type"                 "combined.acidity"    
## [16] "s.a.ratio"            "taste"                "taste.due.to.pH"
## 'data.frame':    1599 obs. of  18 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : Ord.factor w/ 1 level "Red": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined.acidity    : num  8.1 8.68 8.6 12.04 8.1 ...
##  $ s.a.ratio           : num  0.235 0.3 0.267 0.158 0.235 ...
##  $ taste               : Ord.factor w/ 2 levels "Dry"<"Medium_Dry": 1 1 1 1 1 1 1 1 1 1 ...
##  $ taste.due.to.pH     : Ord.factor w/ 3 levels "Dry"<"Medium_Dry"<..: 3 1 1 1 3 3 2 2 2 2 ...

Overview of the Combined Data Set

Combining the two data sets can help us reveal interesting insights to the chemical composition of wines and the resulting variable of quality.
## [1] 6497   18

Quality of a Wine Samples (The only categorical variable in the original data)

The quality of a wine samples was rated by at least 3 wine experts and was categorized between ratings from 0 (very bad) and 10 (excellent). In these particular data sets, wine quality ranges between 3 and 9.
## [1] 3 4 5 6 7 8 9

Summary of the Data Set

White Wine

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality         type      combined.acidity
##  Min.   : 8.00   Min.   :3.000   White:4898   Min.   : 4.130  
##  1st Qu.: 9.50   1st Qu.:5.000                1st Qu.: 6.890  
##  Median :10.40   Median :6.000                Median : 7.405  
##  Mean   :10.51   Mean   :5.878                Mean   : 7.467  
##  3rd Qu.:11.40   3rd Qu.:6.000                3rd Qu.: 7.960  
##  Max.   :14.20   Max.   :9.000                Max.   :14.960  
##    s.a.ratio                taste          taste.due.to.pH
##  Min.   :0.06459   Dry         :3053   Dry         :2286  
##  1st Qu.:0.23495   Medium_Dry  :1591   Medium_Dry  :1985  
##  Median :0.72251   Medium_Sweet: 253   Medium_Sweet: 586  
##  Mean   :0.85776   Sweet       :   1   Sweet       :  41  
##  3rd Qu.:1.28738                                          
##  Max.   :7.02616
Summary: Majority of the white wine samples in this data set have quality around 6 with the mean at 5.9. So we can safely assume that the samples taken were rated by wine experts to have above average quality. The alcohol percentage for most of the samples is around 10-11%. The mean value of total SO2 is around 138 which might cause a slight smell in the nose and taste of wine. The pH of most of the samples is close to 3 which is on the acidic side of the pH scale. The additional variables of combined.acidity and s.a.ratio were created and further used to create categorical variables which helped determine the taste of wine samples. For white wine most fell under the category of dry to medium dry and only a handful of samples were on the sweeter side.

Red Wine

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality       type      combined.acidity
##  Min.   : 8.40   Min.   :3.000   Red:1599   Min.   : 5.270  
##  1st Qu.: 9.50   1st Qu.:5.000              1st Qu.: 7.827  
##  Median :10.20   Median :6.000              Median : 8.720  
##  Mean   :10.42   Mean   :5.636              Mean   : 9.118  
##  3rd Qu.:11.10   3rd Qu.:6.000              3rd Qu.:10.070  
##  Max.   :14.90   Max.   :8.000              Max.   :17.045  
##    s.a.ratio             taste          taste.due.to.pH
##  Min.   :0.1053   Dry       :1580   Dry         :717   
##  1st Qu.:0.2117   Medium_Dry:  19   Medium_Dry  :691   
##  Median :0.2482                     Medium_Sweet:191   
##  Mean   :0.2854                                        
##  3rd Qu.:0.3008                                        
##  Max.   :2.0807
Summary: Red wine samples in this data set have a median quality of 6 with the mean a little lower at 5.6. So we can safely assume for the samples of this data set as well that they were rated to have above average quality. The alcohol percentage for most of the samples is around 10-10.5%, similar to that of white wine samples. The mean value of total SO2 is around 46.5 which is much lower than what we observed for white wine. The pH of most of the samples is again close to 3. The additional variables created for taste show that red wine samples are mostly on the dry side.

Univariate Plots Section

This section explores the variables in the data set in form of uni-variate charts and plots.
NOTE: For histograms, white bars are used for white wine data and red bars for red wine data.

Distribution of Alcohol content of wine samples

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The alcohol percentage distribution of the wine samples is multimodal with major peak at around 9.5% for both type of wines and smaller peak at 11%. The transformed plot with log10 for alcohol shows a similar distribution as well. The frequency polygon shows that for most of the samples alcohol percentage is between 9% and 13% with multiple peaks.

pH variation in the dataset

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The pH values for the samples seemed to be evenly distribute with mean around 3.2 for white and 3.3 for red wine. The log transformation also shows even distribution of pH. Analysis along with other variables in the bi and multivariate sections might reveal more interesting relationships of pH with other variables in the dataset.

Sulphate (SO4) distribution of the samples

The distribution of sulphates is slightly positively skewed for both wine types while the log transformed plot shows it to be somewhat evenly distributed.

Distribution of Sulphur Dioxide (SO2)

NOTE: To better analyze the histogram below, I have limited the x axis scale to 150 g/dm^3 to avoid the only outlier for white wine at 289 g/dm3. This is evident by the summation performed below.
sum(wqw$free.sulfur.dioxide >150)
## [1] 1

Free Sulphur Dioxide

Free sulphur dioxide for white wine samples is evenly distributed while for red wine samples it is positively skewed. Most of the SO2 samples for both wine types are less than 100 mg/dm3. The red wine distribution being heavily skewed, also shows slight bi-modality in the transformed plot.
NOTE: Again, for better analysis x axis scale limit is set to 150 g/dm^3 avoiding 6 outliers for white wine. Summation performed below to double check.

sum(wqw$total.sulfur.dioxide >300)

## [1] 6

Total Sulphur Dioxide

Total Sulphur Dioxide being a combination of free and bound SO2 is mostly undetected in wine. Again, the red wine samples are positively skewed while white wine samples show an approximate bell shaped distribution. Skewness for red wine samples shows wide spread with multimodality in the transformed plot just like we saw for Free SO2.

Distribution of Acidic content in the Wine samples

Total Acidic Content

Total acidic content for the two wine types is clearly different. The samples for White wine for all the acid types follow an approximate normal distribution, while the ones for red wine show slight multimodality for fixed.acitdity and multi-modality for volatile.acidity and citric acid.
## [1] 132
There are 132 samples of red wine which have no citric acid.

Combined Acidity

The combined acidity plots show a very slight positive skewness for red wine samples while white wine samples follow an approximate normal distribution. This created feature will help determine the taste of the wine samples.

Sugar content in the Wine samples

Histogram for Residual Sugar shows that most of the samples have sugar level between 1-3 g/dm3. The distribution for white wine samples is positively skewed with transformed plot showing multi-modality with peaks at around 0.2, 0.9 and 1.2 g/dm3. On the other hand, the red wine samples seemed to more evenly distributed with slight positive skewness.

Detailed histogram for Residual Sugar also shows that most of the samples have sugar level between 1 - 3 g/dm3. Plot also shows the long tale to the positive side for white wine samples which is in accord with the multi-modality we observed in the transformed plot.

Sugar to acid ratio of Wine samples

The sugar to acid ratio plots are heavily skewed towards the right with the transformed plots showing bi-modality having two significant peaks for white wine samples. While red wine samples are more evenly distributed following an approximate normal distribution. Sugar to acid ratio being a derived variable from residual sugar, shows similar characteristics.

Taste of Wine samples

The overall taste distribution (variable based on the information at this url: http://drinkriesling.com/tastescale/thescale) is on the dry side. Summary for taste variation in both the data sets is shown below.

Taste of White Wine

##          Dry   Medium_Dry Medium_Sweet        Sweet 
##         3053         1591          253            1

Taste of Red Wine

##        Dry Medium_Dry 
##       1580         19

Taste of Wine samples due to pH

Taste distribution due to pH changes (variable based on the information at this url: http://drinkriesling.com/tastescale/thescale) has been shifted up a bucket to the Medium Dry. Summary for taste variation due to pH in both the data sets is shown below.

Taste of White Wine due to PH

##          Dry   Medium_Dry Medium_Sweet        Sweet 
##         2286         1985          586           41

Taste of Red Wine due to PH

##          Dry   Medium_Dry Medium_Sweet 
##          717          691          191

Saltiness in Wine samples

Plots show most samples for both the types lie between 0-0.1 g/dm3 with few outliers having salt content around 0.4 g/dm3. The transformed plot for white wine samples show peak at -1.25 for white wine samples and -1.2 for the red wine samples.
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 485   485           6.2            0.370        0.30            6.6
## 1218 1218           8.0            0.610        0.38           12.1
## 4916   18           8.1            0.560        0.28            1.7
## 4918   20           7.9            0.320        0.51            1.8
## 4941   43           7.5            0.490        0.20            2.6
## 4980   82           7.8            0.430        0.70            1.9
## 4982   84           7.3            0.670        0.26            1.8
## 5005  107           7.8            0.410        0.68            1.7
## 5050  152           9.2            0.520        1.00            3.4
## 5068  170           7.5            0.705        0.24            1.8
## 5125  227           8.9            0.590        0.50            2.0
## 5157  259           7.7            0.410        0.76            1.8
## 5180  282           7.7            0.270        0.68            3.5
## 5190  292          11.0            0.200        0.48            2.0
## 5350  452           8.4            0.370        0.53            1.8
## 5591  693           8.6            0.490        0.51            2.0
## 5629  731           9.5            0.550        0.66            2.3
## 5653  755           7.8            0.480        0.68            1.7
## 5950 1052           8.5            0.460        0.59            1.4
## 6064 1166           8.5            0.440        0.50            1.9
## 6159 1261           8.6            0.635        0.68            1.8
## 6218 1320           9.1            0.760        0.68            1.7
## 6269 1371           8.7            0.780        0.51            1.7
## 6271 1373           8.7            0.780        0.51            1.7
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 485      0.346                  79                  200 0.99540 3.29
## 1218     0.301                  24                  220 0.99930 2.94
## 4916     0.368                  16                   56 0.99680 3.11
## 4918     0.341                  17                   56 0.99690 3.04
## 4941     0.332                   8                   14 0.99680 3.21
## 4980     0.464                  22                   67 0.99740 3.13
## 4982     0.401                  16                   51 0.99690 3.16
## 5005     0.467                  18                   69 0.99730 3.08
## 5050     0.610                  32                   69 0.99960 2.74
## 5068     0.360                  15                   63 0.99640 3.00
## 5125     0.337                  27                   81 0.99640 3.04
## 5157     0.611                   8                   45 0.99680 3.06
## 5180     0.358                   5                   10 0.99720 3.25
## 5190     0.343                   6                   18 0.99790 3.30
## 5350     0.413                   9                   26 0.99790 3.06
## 5591     0.422                  16                   62 0.99790 3.03
## 5629     0.387                  12                   37 0.99820 3.17
## 5653     0.415                  14                   32 0.99656 3.09
## 5950     0.414                  16                   45 0.99702 3.03
## 6064     0.369                  15                   38 0.99634 3.01
## 6159     0.403                  19                   56 0.99632 3.02
## 6218     0.414                  18                   64 0.99652 2.90
## 6269     0.415                  12                   66 0.99623 3.00
## 6271     0.415                  12                   66 0.99623 3.00
##      sulphates alcohol quality  type combined.acidity s.a.ratio      taste
## 485       0.58     9.6       5 White            6.870 0.9606987        Dry
## 1218      0.48     9.2       5 White            8.990 1.3459399 Medium_Dry
## 4916      1.28     9.3       5   Red            8.940 0.1901566        Dry
## 4918      1.08     9.2       6   Red            8.730 0.2061856        Dry
## 4941      0.90    10.5       6   Red            8.190 0.3174603        Dry
## 4980      1.28     9.4       5   Red            8.930 0.2127660        Dry
## 4982      1.14     9.4       5   Red            8.230 0.2187120        Dry
## 5005      1.31     9.3       5   Red            8.890 0.1912261        Dry
## 5050      2.00     9.4       4   Red           10.720 0.3171642        Dry
## 5068      1.59     9.5       5   Red            8.445 0.2131439        Dry
## 5125      1.61     9.5       6   Red            9.990 0.2002002        Dry
## 5157      1.26     9.4       5   Red            8.870 0.2029312        Dry
## 5180      1.08     9.9       7   Red            8.650 0.4046243        Dry
## 5190      0.71    10.5       5   Red           11.680 0.1712329        Dry
## 5350      1.06     9.1       6   Red            9.300 0.1935484        Dry
## 5591      1.17     9.0       5   Red            9.600 0.2083333        Dry
## 5629      0.67     9.6       5   Red           10.710 0.2147526        Dry
## 5653      1.06     9.1       6   Red            8.960 0.1897321        Dry
## 5950      1.34     9.2       5   Red            9.550 0.1465969        Dry
## 6064      1.10     9.4       5   Red            9.440 0.2012712        Dry
## 6159      1.15     9.3       5   Red            9.915 0.1815431        Dry
## 6218      1.33     9.1       6   Red           10.540 0.1612903        Dry
## 6269      1.17     9.2       5   Red            9.990 0.1701702        Dry
## 6271      1.17     9.2       5   Red            9.990 0.1701702        Dry
##      taste.due.to.pH
## 485              Dry
## 1218      Medium_Dry
## 4916             Dry
## 4918             Dry
## 4941             Dry
## 4980             Dry
## 4982             Dry
## 5005             Dry
## 5050             Dry
## 5068             Dry
## 5125             Dry
## 5157             Dry
## 5180             Dry
## 5190      Medium_Dry
## 5350             Dry
## 5591             Dry
## 5629             Dry
## 5653             Dry
## 5950             Dry
## 6064             Dry
## 6159             Dry
## 6218             Dry
## 6269             Dry
## 6271             Dry
Here is a subset of the data set showing outliers for salt content with values greater than 0.3 g/dm3

Detailed plot shows normality in the regular histogram. The major peak is at 0.045 g/dm3 for white wine and at 0.085 g/dm3 for red wine.

Density of the Wine samples

NOTE: Histogram is limited by x axis to view the distribution of density of wine samples with more clarity.

Density for both wine types is primarily observed between 0.98 - 1 gm/cm3 with the distributions following close to normal distribution.

Quality of Wine samples

Quality being the only categorical and the output variable has value of 6 for most of the samples of white wine while 5 is the quality rating given to most of the red wine samples.

Univariate Analysis

What is the structure of your dataset?

The dataset consists of 6497 observations for wines, out of which 4898 are for white wine and the remaining for red wines. In the original data set there were 12 variables with one being an output variable (quality). Quality is based on sensory data provided by at least 3 wine experts and is scored between 0 (poor) and 10 (excellent).
Useful observations at a glance:
- Majority of the wine samples have been rated with quality rating of 6 and mean rating of 5.6
- The mean alcohol content in both type of wines is around 10.5%
- Mean total SO2 for white is 138.4 and red is 46.47, while the max total SO2 are 440 and 289 respectively, suggesting them to be outliers.
- Similar stats can be observed for the free SO2 content in the samples as median values are far below the max values.
- Residual sugar samples for both wine types also show anomalies with max values being extremely high compared to the 3rd quartile values.

What is/are the main feature(s) of interest in your dataset?

Main features in my opinion from the Wine data sets are quality and alcohol. Other features that might play an important role in determining the quality of a wine sample could be residual sugar and acidic contents (fixed, volatile and citric).
Correlation results for Quality and Alcohol are shown below:

White Wine

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$alcohol and wqw$quality
## t = 33.8585, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

Red Wine

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol and wqr$quality
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Additional features that might be effective in determining the quality of wine could be pH value, density and sulphate content (which acts as anti-oxidant).

Did you create any new variables from existing variables in the dataset?

The following new variables were created using the existing variables:
- Combination of all the acidity measuring variables were combined to form ‘combined.acidity’, i.e. the sum of fixed.acidity, volatile.acidity & citric.acid. This new variable was used as an input for another new variable.
combined.acidity = fixed.acidity + volatile.acidity + citric.acid
- A variable for ratio of Sugar to Acid (s.a.ratio) in the wine samples was also created to measure the propotion of dryness and sweetness.
s.a.ratio = residual.sugar / combined.acidity
- A variable for ‘type’ was created when combining the White and Red wine data sets to distinguish between the two.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The histogram for alcohol when first created showed multi-modality and was hard to interpret. Frequency polygon was created so that the data looks tidier and is easy to interpret. Alcohol being an important feature in the data set, the frequency polygon shows the variations clearly.
Distributions for residual sugar, free sulphur dioxide and sugar to acid ratio were transformed to view modality. Histograms for all of these plots were positively skewed when initially plotted and plotting log10 transformation revealed multi-modality. Plot of residual sugar for white wine samples shows peaks at around 0.2, 0.9 and 1.2. Free sulphur dioxide transformed plot for red wine samples shows multi-modality with no significant peak. The sugar to acid ratio histogram for white wine showed large number of samples with ratio less than 0.25 and when transformed showed multi-modality with peaks at around -0.7, 0.1 and 0.3.
For ease of viewing, some of the plots were adjusted by tweaking the scales. The histograms for chlorides and residual sugar were adjusted by changing scales and binwidth to view the distribution in more detail. Residual sugar content in white wine gave an insight for the widespread distribution of the samples which ultimately was proven by the transformed plot. Adjusting the Chloride plot showed a more sampled version which was easier to interpret.

Bivariate Plots Section

We will start off by tabulating/plotting the original variables from the Wine data sets to infer about interesting relationships.

Correlation between Wine data set variables

Correlation table of White Wine

##               Fixed Acid Volatile Acid Citric Acid Sugar  Salt Free SO2
## Fixed Acid          1.00         -0.02        0.29  0.09  0.02    -0.05
## Volatile Acid      -0.02          1.00       -0.15  0.06  0.07    -0.10
## Citric Acid         0.29         -0.15        1.00  0.09  0.11     0.09
## Sugar               0.09          0.06        0.09  1.00  0.09     0.30
## Salt                0.02          0.07        0.11  0.09  1.00     0.10
## Free SO2           -0.05         -0.10        0.09  0.30  0.10     1.00
## Total SO2           0.09          0.09        0.12  0.40  0.20     0.62
## Density             0.27          0.03        0.15  0.84  0.26     0.29
## pH                 -0.43         -0.03       -0.16 -0.19 -0.09     0.00
## SO4                -0.02         -0.04        0.06 -0.03  0.02     0.06
## Alcohol            -0.12          0.07       -0.08 -0.45 -0.36    -0.25
## Quality            -0.11         -0.19       -0.01 -0.10 -0.21     0.01
##               Total SO2 Density    pH   SO4 Alcohol Quality
## Fixed Acid         0.09    0.27 -0.43 -0.02   -0.12   -0.11
## Volatile Acid      0.09    0.03 -0.03 -0.04    0.07   -0.19
## Citric Acid        0.12    0.15 -0.16  0.06   -0.08   -0.01
## Sugar              0.40    0.84 -0.19 -0.03   -0.45   -0.10
## Salt               0.20    0.26 -0.09  0.02   -0.36   -0.21
## Free SO2           0.62    0.29  0.00  0.06   -0.25    0.01
## Total SO2          1.00    0.53  0.00  0.13   -0.45   -0.17
## Density            0.53    1.00 -0.09  0.07   -0.78   -0.31
## pH                 0.00   -0.09  1.00  0.16    0.12    0.10
## SO4                0.13    0.07  0.16  1.00   -0.02    0.05
## Alcohol           -0.45   -0.78  0.12 -0.02    1.00    0.44
## Quality           -0.17   -0.31  0.10  0.05    0.44    1.00

Correlation plot of White Wine

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Summary: Density, SO2, sugar, alcohol and acid samples for white wine show correlation with each other. Although poor, density also shows some correlation with salt. Main features for our data set, Alcohol and Quality have a relatively strong correlation of 0.44. Density and sugar have the highest correlation of 0.84. These relationships will be explored in the later sections.

Correlation table of Red Wine

##               Fixed Acid Volatile Acid Citric Acid Sugar  Salt Free SO2
## Fixed Acid          1.00         -0.26        0.67  0.11  0.09    -0.15
## Volatile Acid      -0.26          1.00       -0.55  0.00  0.06    -0.01
## Citric Acid         0.67         -0.55        1.00  0.14  0.20    -0.06
## Sugar               0.11          0.00        0.14  1.00  0.06     0.19
## Salt                0.09          0.06        0.20  0.06  1.00     0.01
## Free SO2           -0.15         -0.01       -0.06  0.19  0.01     1.00
## Total SO2          -0.11          0.08        0.04  0.20  0.05     0.67
## Density             0.67          0.02        0.36  0.36  0.20    -0.02
## pH                 -0.68          0.23       -0.54 -0.09 -0.27     0.07
## SO4                 0.18         -0.26        0.31  0.01  0.37     0.05
## Alcohol            -0.06         -0.20        0.11  0.04 -0.22    -0.07
## Quality             0.12         -0.39        0.23  0.01 -0.13    -0.05
##               Total SO2 Density    pH   SO4 Alcohol Quality
## Fixed Acid        -0.11    0.67 -0.68  0.18   -0.06    0.12
## Volatile Acid      0.08    0.02  0.23 -0.26   -0.20   -0.39
## Citric Acid        0.04    0.36 -0.54  0.31    0.11    0.23
## Sugar              0.20    0.36 -0.09  0.01    0.04    0.01
## Salt               0.05    0.20 -0.27  0.37   -0.22   -0.13
## Free SO2           0.67   -0.02  0.07  0.05   -0.07   -0.05
## Total SO2          1.00    0.07 -0.07  0.04   -0.21   -0.19
## Density            0.07    1.00 -0.34  0.15   -0.50   -0.17
## pH                -0.07   -0.34  1.00 -0.20    0.21   -0.06
## SO4                0.04    0.15 -0.20  1.00    0.09    0.25
## Alcohol           -0.21   -0.50  0.21  0.09    1.00    0.48
## Quality           -0.19   -0.17 -0.06  0.25    0.48    1.00

Correlation plot of Red Wine

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Summary: For red wine, density, SO2, sugar, salt and acid content samples show stronger correlations with each other. Alcohol and Quality for red wine also have a strong correlation of 0.48. Density and fixed acidity content have the highest correlation with r^2 = 0.67. Relationships are further explored in the sections below.
Now lets explore the relationships in greater detail beginning our analysis with the main variables of interest.

Relation between Quality and Alcohol

Revisiting Alcohol distribution for both types of wines

Scatter Plot for White wine (Quality vs. Alcohol)

Table for Quality vs. Alcohol of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
It is evident that alcohol percentage in the samples tend to produce better quality or the wine expert tend to give higher quality rating to the samples with higher alcohol content. The relationship between alcohol content in the sample seems linear with respect to quality. Summary table and the regression line also shows that quality gets better with increase in mean alcohol content.

Box Plot for White wine (Quality vs. Alcohol)

The box plot reveals another angle to the relationship between quality and alcohol. It turns out that both the mean and median alcohol levels drop from quality rating of 3 to 5. Then for higher rating (above 5) mean and median levels of alcohol increase almost linearly.

Scatter Plot for Red wine (Quality vs. Alcohol)

Table for Quality vs. Alcohol of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
Red wine data set has fewer observations but shows a positive relation between alcohol and quality. Generally, more the alcohol content better the quality rating. Regression line plotted and the summary table above is also consistent with what is observed in the scatter plot.

Box Plot for Red wine (Quality vs. Alcohol)

Relationship between Quality and Taste for White wine

Plot for Quality vs. Taste of White wine

Plots suggest that most of the samples with higher quality fall under Dry to Medium Dry taste categories. Even though the ranges shift due pH but major contributor towards quality is the sample with taste on the dry side. Summary table for the result is shown below.

Table for Quality vs. Taste of White wine

## wqw$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.958   7.000   9.000 
## -------------------------------------------------------- 
## wqw$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.775   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   6.000   5.553   6.000   8.000 
## -------------------------------------------------------- 
## wqw$taste: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6
NOTE: Category Sweet is an exception as it has only one instance.

Table for Quality vs. Taste (pH) of White wine

## wqw$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.886   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.907   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.787   6.000   8.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   5.000   5.000   5.341   6.000   6.000

Relationship between Quality and Taste for Red wine

Plot for Quality vs. Taste of Red wine

For red wine, plots tell a similar story. Samples tend to taste drier than white wine as evident from taste vs. quality histogram (there are no ranges for medium sweet and sweet). Plotting taste due pH changes again show more contribution from the dryer samples towards good quality. Quality for different buckets of taste are summarized below.

Table for Quality vs. Taste of Red wine

## wqr$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.637   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   6.000   5.526   6.000   6.000

Table for Quality vs. Taste (pH) of Red wine

## wqr$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.681   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.618   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.534   6.000   8.000

Alcohol relationship with pH

Distribution of pH for wine samples revisted

Scatter Plot for White wine (Alcohol vs. pH)

Correlation coefficient for White wine (Alcohol vs. pH)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$alcohol and wqw$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09374446 0.14893205
## sample estimates:
##       cor 
## 0.1214321
Although, most of the samples of white wine are concentrated in the pH range of 3.0-3.3, there seems to be a slight positive correlation between alcohol and pH. This relationship is shown by the correlation coefficient calculation and the regression line drawn on the scatter plot.

Scatter Plot for Red wine (Alcohol vs. pH)

Correlation coefficient for Red wine (Alcohol vs. pH)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol and wqr$pH
## t = 8.397, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1582061 0.2521123
## sample estimates:
##       cor 
## 0.2056325
Most of the points are concentrated between pH of 3.2-3.5 and alcohol percentage of 9-11. Apart from the few outliers, there seems to be a positive correlation between the two variables, the factor is confirmed by the correlation test performed. The line of best fit drawn on the scatter plot also confirms a positive relationship between the two variables.

Variations in Alcohol with respect to Taste for White wine

Frequency Plot for Alcohol vs. Taste of White wine

Box Plot for Alcohol vs. Taste of White wine

It is clearly evident from the frequency plots, wine samples that are dry have more alcohol content while sweeter samples have less. Box plots are also consistent with what is in the frequency plots and follow a linear trend showing that decrease in alcohol content leads towards sweeter taste.

Table for Alcohol vs. Taste of White wine

## wqw$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   10.00   10.90   10.94   11.80   14.20 
## -------------------------------------------------------- 
## wqw$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.100   9.500   9.871  10.400  14.050 
## -------------------------------------------------------- 
## wqw$taste: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.500   8.800   9.100   9.407   9.600  13.000 
## -------------------------------------------------------- 
## wqw$taste: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.7    11.7    11.7    11.7    11.7    11.7
NOTE: Category ‘Sweet’ in taste vs. alcohol is an exception as it has only one instance.

Table for Alcohol vs. Taste (pH) of White wine

## wqw$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.90   10.80   10.89   11.89   14.20 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.30   10.00   10.26   11.00   14.05 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.100   9.800   9.981  10.500  14.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.700   8.800   9.500   9.622  10.100  12.400

Variations in Alcohol with respect to Taste for Red wine

Frequency Plot for Alcohol vs. Taste of Red wine

Box Plot for Alcohol vs. Taste of Red wine

For red Wine, plot colored by taste does not depict the exact effect of alcohol on taste as samples with ‘Medium Dry’ range are pretty low. Plotting alcohol colored by taste.due.to.pH reveals a linear trend showing that with increase in alcohol percentage, taste gets sweeter.

Table for Alcohol vs. Taste of Red wine

## wqr$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.43   11.10   14.90 
## -------------------------------------------------------- 
## wqr$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.800   9.200   9.900   9.984  10.400  12.200

Table for Alcohol vs. Taste (pH) of Red wine

## wqr$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.29   11.00   14.90 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.70    9.50   10.30   10.41   11.00   14.00 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.233   9.850  10.800  10.960  11.700  14.000

Relationship between Quality and Sulphates

Sulphates revisited

Scatter Plot for White wine (Quality vs. Sulphates)

Box Plot for White wine (Quality vs. Sulphates)

Table and Correlation Coefficient for Quality vs. Sulphates of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2800  0.3800  0.4400  0.4745  0.5425  0.7400 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4700  0.4761  0.5400  0.8700 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2700  0.4200  0.4700  0.4822  0.5300  0.8800 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4800  0.5031  0.5800  1.0800 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4600  0.4862  0.5850  0.9500 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.360   0.420   0.460   0.466   0.480   0.610
## 
##  Pearson's product-moment correlation
## 
## data:  wqw$sulphates and wqw$quality
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02571007 0.08156172
## sample estimates:
##        cor 
## 0.05367788
The Sulphates relation with quality of white wine shows no significant correlation. For mean sulphate content it shows fluctuating trend between rating 3-5, then increases very slightly from rating 5-7 followed by a drop for rating 8 and 9 as observed in the box plot representation. The median value of sulphate content (as per the summary table) for all quality buckets is almost the same i.e. (0.46-0.48). This observation is also confirmed by the low correlation coefficient value and the regression line drawn, which is almost parallel to the x axis.

Scatter Plot for Red wine (Quality vs. Sulphates)

Box Plot for Red wine (Quality vs. Sulphates)

Table and Correlation Coefficient for Quality vs. Sulphates of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000
## 
##  Pearson's product-moment correlation
## 
## data:  wqr$sulphates and wqr$quality
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
Red wine samples have a slightly better correlation between sulphates and quality compared to white wine. As evident from the box plots, regression line and the scatter plot, increase in sulphates per samples on average results in a better quality rating. This is also evident in the summary table, mean and median values for sulphate content show increasing trend and so does the correlation coefficient .

Variations in Quality due to pH for White wine

Scatter Plot for White wine (Quality vs. pH)

Box Plot for White wine (Quality vs. pH)

Table and Correlation Coefficient for Quality vs. pH of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410
## 
##  Pearson's product-moment correlation
## 
## data:  wqw$pH and wqw$quality
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725
The relationship between quality and pH shows a slight linear trend. Mean and median values (as viewed in the box plot and summary table) for pH decrease slightly for lower quality ratings between 3-5 and then start to gradually increase showing slight contribution towards the quality of the white wine samples. The overall slight positive trend is also depicted by the regression line accompanying the scatter plot and the r^2 value.

Relation between Quality and Citric Acid for Red wine

Citric Acid Histogram for Red wine samples

Scatter Plot for Red wine (Quality vs. Citric Acid)

Box Plot for Red wine (Quality vs. Citric Acid)

Table and Correlation Coefficient for Quality vs. Citric Acid of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200
## 
##  Pearson's product-moment correlation
## 
## data:  wqr$citric.acid and wqr$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
The scatter plot, regression line and mean-median values on box plot for citric acid show gradual increase with increase in quality or it can be said, quality increases as citric acid content increases for the samples of Red wine. This is also depicted by the calculated r^2 value.
Let us explore interesting correlations between features other than the main features, i.e. alcohol and quality.

Correlation between Density and Residual Sugar for White wine

NOTE: I chose the variables that had the strongest correlation, i.e density and residual sugar for white wine and density and fixed acidity for red wine.

Scatter Plot for White wine (Density vs. Residual Sugar)

NOTE: In the scatter plot below, scales are limited to exclude outliers.
## Warning in loop_apply(n, do.ply): Removed 3 rows containing missing values
## (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 3 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 13 rows containing missing
## values (geom_path).

Correlation coefficient for White wine (Density vs. Residual Sugar)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$density and wqw$residual.sugar
## t = 107.8749, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665
The scatter plot and correlation coefficient clearly show a strong relation between density and residual sugar for white wine samples. An r^2 of 0.84 suggests that density and residual sugar can be associated in form of a linear equation which is confirmed by the regression line.

Relationship between Density and Fixed Acidity for Red wine

Scatter Plot for Red wine (Density vs. Fixed Acidity)

Correlation coefficient for Red wine (Density vs. Fixed Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$density and wqr$fixed.acidity
## t = 35.8771, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473
The scatter plot for red wine data has wide distribution of points but shows very high correlation between density and fixed acidity. This is also evident from the regression line plot and the correlation coefficient calculation above.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Relationships observed in this section are as follows:
Quality with Alcohol: Scatter plots, regression line and box plots were created to identify this relation. Quality and Alcohol being the main features in the data sets, for both White and Red wines, showed high correlation of 0.44 and 0.48 respectively. Scatter plots and regression line helped view the bigger picture, i.e. positive correlation between Quality and Alcohol while box plots showed more depth, showing for lower quality ratings (3-5) alcohol levels drop and then gradually increase linearly after quality level 5.
Alcohol relationship with pH: For white wine, samples were concentrated in the pH range of 3.0-3.3 and mostly under alcohol percentage of 10.5. Lot of scattering made it difficult to identify the extent of correlation between the two variables so the scatter plot was accompanied by a regression line and r^2 was calculated, the outcome came to low but positive showing some relation. For red wine, most of the points were observed to be concentrated between pH of 3.2-3.5 and alcohol percentage of 9-11. By looking at the scatter plot, visually there seemed to be reasonable correlation between these variable which was confirmed by the regression line and the calculated r^2 of 0.2.
Alcohol with respect to Taste and Taste due to pH: White wine samples that taste on the dry side tend to have more alcohol content while sweeter tasting samples have less alcohol. For red wine, samples that are dry have lesser alcohol content compared to the sweeter ones. This observation is evident by the box plot representation in the Bivariate Plots section.
Quality vs. Sulphates: Box plots, scatter plots and regression line were created to determine the relation between quality and sulphates. It seemed that white wine samples had poor correlation compared to red wine samples, confirmed by the r^2 calculated. For white wine, sulphate content slightly increased from quality rating of 3-7 and then dropped from then on. On the other hand, samples for red wine had better correlation as increase in average sulphate content per sample showed increase in quality.
Quality with pH for White wine: Relation of quality and pH for white wine samples followed a parabolic function, i.e. when mean and median values of pH are plotted for each quality rating bucket, the pH value decreased till the quality rating of 5(minima) and then started to gradually increase. The iverall positive trend was also depicted by the R^2 value calculated.
Quality with Citric Acid for Red wine: Scatter plot and the regression line showed close to a linear relationship between quality and citric acid for samples of red wine. Box plots and the r^2 value further made it clear that quality generally increases as citric acid content on average increase per sample of red wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There were interesting correlations observed between other features in the two data sets. The features showing highest correlation coefficient for each of the wine type were chosen:
Correlation between Density and Residual Sugar for White wine: Scatter diagram and regression line were plotted and correlation coefficient was calculated to view the relation between density and residual sugar for white wine. The correlation was observed to be very strong (r^2 = 0.84).
Relationship between Density and Fixed Acidity for Red wine: Again, density and fixed acidity were plotted against each other using scatter plot and regression line functionality. Most of the points were observed to be in the range of density of 0.99 - 1.00 g/cm^3 for red wine. Along with the plots, correlation coefficient was also computed between density and fixed acidity which showed a stronger correlation for the red wine samples (r^2 = 0.67).

What was the strongest relationship you found?

The strongest relationship observed was between density and residual sugar for white wine samples (r^2 = 0.84)
Other notable correlations are as follows:
- Density vs. Fixed Acidity for red wine (r^2 = 0.67)
- Quality with Alcohol for both data sets (r^2 = 0.48[Red] and r^2 = 0.44[White])

Multivariate Plots Section

This section covers exploring relationships between multiple variables of the two wine data sets. Let us start with the main features first.

Alcohol with Quality

Density Plot of Alcohol with Quality for White wine

The density plot shows us clearly, samples with lower percentage of alcohol content lead the wine experts to give lower ratings and increase in alcohol content specially greater than 11% (with and exception in quality rating 9) generally leads them to give higher ratings like 6, 7, 8 or 9.

Density Plot of Alcohol with Quality for Red wine

The density plot for red wine shows a similar trend. Lower the percentage of alcohol, lower the rating given to the sample and vice versa.
Let us introduce another variable and look for more interesting patterns.

Alcohol and Quality with Taste for White wine

Box Plot for White wine (Quality vs. Alcohol) with Taste

Facet wrapping by taste does show an overall positive relationship between quality and alcohol. For ‘Dry’ tasting samples alcohol content decreases between ratings 3-5 and almost linearly increases for ratings 5 and above. ‘Medium Dry’ taste bucket shows a similar trend as the ‘Dry’ bucket with the exception of rating 9. There are fewer samples in the ‘Medium Sweet’ bucket compared to the other two but the trend observed is positive. Taste bucket for ‘Sweet’ is an exception as there was only one sample which tasted sweet.

Box Plot for White wine (Quality vs. Alcohol) with Taste due to pH

Bringing pH into the equation, show similar results which was as expected. Again with the exception of the ‘Sweet’ taste bucket, rest of the buckets show fluctuation but overall decrease in alcohol content between lower quality ratings 3-5 and then linear increase from quality rating 5 onwards.

Alcohol and Quality with Taste for Red wine

Box Plot for Red wine (Quality vs. Alcohol) with Taste

As the samples for the red wine data sets are less compared to white wine, plotting alcohol vs. quality with taste does not tell us much. The ‘Dry’ bucket shows fluctuating trend for alcohol from rating 3-5 and linear increase for higher quality rating values. Below is a plot using taste due to pH.

Box Plot for Red wine (Quality vs. Alcohol) with Taste due to pH

Shifting to pH, we only have three taste buckets. For ‘Dry’ taste we see a linear positive trend, i.e. increase in alcohol content improves the quality rating. For ‘Medium Dry’ and ‘Medium Sweet’, alcohol content fluctuates slightly for lower quality ratings and then starts to increase as the quality rating gets better.

pH with Quality

Density Plot of pH with Quality for White wine

Plotting pH shows that (apart from the exception of quality rating 3 where pH covers a wide-range) most of the mid-range quality ratings are given to samples with pH from 3.0-3.2. Looking at quality rating 8 and 9 we see that there is slight increase in pH range which tells us that pH plays a role in the higher rated samples of white wine.

Density Plot of pH with Quality for Red wine

In contrast to white wine, higher ranges of pH per sample for red wine tend to receive poor ratings (the shift in the lower rating curves towards higher pH) and vice versa.

Relation of Alcohol and pH with Quality for White wine

Scatter Plot for White wine (Alcohol vs. pH) with Quality

Again, to analyze the trend here I have only selected the quality rating bucket with most samples. It can be seen for samples with lower percentage of alcohol and slightly higher pH the quality tends to suffer. On the other hand, for samples with higher alcohol content (greater than 11%) and a stable pH (eg: 3.2) the experts give good quality rating.

Relation of Alcohol and pH with Quality for Red wine

Scatter Plot for Red wine (Alcohol vs. pH) with Quality

Applying the same quality bucket configuration to red wine samples reveals similar results. With less alcohol content and pH on a higher side, experts tend to give lower ratings. In contrast to this for more alcohol content and pH controlled under 3.3 leads the expert to rank the samples high.

Sulphates and Quality

Density Plot of Sulphates with Quality for White wine

For white wine, sulphate content seems not to effect the quality rating that much. The exception is the curve for rating 9 with a spike close to 0.6 g/dm3 apart from that most of the curves for quality rating buckets overlap with each other.

Density Plot of Sulphates with Quality for Red wine

Here things look different from white wine samples. Red wine samples that receive high quality rating do have access of sulphates in them as can be seen from the curves for rating 7 and 8. There peaks stand out alone from the rest at sulphate content of 0.7 g/dm3 or more.

Sulphates and Alcohol with Quality for White wine

Scatter Plot for White wine (Sulphates vs. Alcohol) with Quality

The plot for alcohol and sulphates shows samples binned by quality rating. It is difficult to find a trend with a number of outliers over the edges. Lets look at the zoomed version in the plot below.

Scatter Plot for White wine (Sulphates vs. Alcohol) with Quality [zoomed]

## Warning in loop_apply(n, do.ply): Removed 700 rows containing missing
## values (geom_point).

I filtered the quality buckets and can only view from quality rating 5-7 (majority of the points). The points plotted are mostly scattered and do not give a very good trend as to how sulphates effect quality (in accordance with the density plots in the section above).

Sulphates and Alcohol with Quality for Red wine

Scatter Plot for Red wine (Sulphates vs. Alcohol) with Quality

NOTE: Axes scaled to avoid outliers and better represent the data.
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).

Scatter plot for red wine samples is pretty straight forward and shows some degree of linearity between alcohol and sulphate content. Higher alcohol percentage and more sulphate content (around 0.8-1.0 g/dm3) leads to higher quality rating (greater than 6). The lower quality ratings (3-5) mostly occur in the region where we have low sulphate content and low alcohol percentage.

Residual Sugar with Quality

Density Plot of Residual Sugar with Quality for White wine

In my opinion, residual sugar plays an important part for the expert to decide the quality rank for a particular wine sample. By looking at the plot it looks like most of high rated samples have less than 5 g/dm3. For the remaining samples, quality ratings fluctuate between residual sugar range of 10-20 g/dm3.

Density Plot of Residual Sugar with Quality for Red wine

Most of the samples, independent of the quality rating hey receive have residual sugar level less than 4 g/dm3. For samples with residual sugar more than 4 g/dm3, ratings are mixed, i.e. not following a certain pattern.

Residual Sugar and Alcohol with Quality for White wine

Scatter Plot for White wine (Residual Sugar vs. Alcohol) with Quality

NOTE: Few outliers were excluded for better anaylsis of the data.
## Warning in loop_apply(n, do.ply): Removed 5 rows containing missing values
## (geom_point).

Most of the points in the scatter plot are concentrated at less than 5 g/dm3 of residual sugar as seen in the density plot and it can be viewed here in the scatter plot that samples with more sugar content do receive higher quality ratings.

Residual Sugar and Alcohol with Quality for Red wine

Scatter Plot for Red wine (Residual Sugar vs. Alcohol) with Quality [zoomed]

## Warning in loop_apply(n, do.ply): Removed 129 rows containing missing
## values (geom_point).

The zoomed version of scatter for red wine does show a trend. Samples that get higher quality ratings have residual sugar under 2.5 g/dm3 and alcohol content more than 11%. Samples with mid-range quality ratings (5-6) have more residual sugar content and alcohol percentage less 10% on average.

Acidic Content and Quality

Total Acidic Content with Quality (White wine on left and Red wine on right)

Citric Acid: For white wine, almost all of the samples have citric acid content less than 0-0.6 g/dm3 and quality ratings vary between this bracket. For red wine, all samples have citric acid less than 1 g/dm3 where samples with high quality ratings have acid content more than 0.3 g/dm3 while the one with lower ratings have citric acid content less than 0.25 g/dm3 as seen in the density plot..

Acidic Content and Alcohol with Quality

Scatter plot for Total Acidic Content and Alcohol with Quality (White wine on left and Red wine on right)

NOTE: Outliers from all the acidic content plots were carefully observed and excluded for better analysis.
## Warning in loop_apply(n, do.ply): Removed 4 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 9 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 19 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 31 rows containing missing
## values (geom_point).

White wine: General trend seems that increase in acidic content does not play an important part in improving quality, however, it can be seen that samples with increase in volatile acidity do get lower ratings.
Red wine: There is a definite trend here, samples with increased alcohol percentage and increased form of any acidic content tends to get higher quality rating and vice versa (specially fixed acidity and citric acid).

Chlorides and Quality

Density Plot of Chlorides with Quality for White wine

Salts in drinks should be balanced. It can be observed in the plot that white wine samples with salts more than 0.05g/dm3 get mid-range to lower-range quality rating from the experts, while balanced samples get better ratings (7-9). A samples(s) can be seen clearly receiving lowest rating of 3 as the salt content seems to be sky rocketing (around 0.24 g/dm3)

Density Plot of Chlorides with Quality for Red wine

Most of the samples have chlorides levels under 0.1 g/dm3 and quality is almost independent as rating curves overlap with each other. Again curve for rating 3 shows variations showing that there were samples that were excessively salty.

Chlorides and Alcohol with Quality

Scatter Plot for (Chlorides vs. Alcohol) with Quality [Facet Wrap]

## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 9 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 3 rows containing missing values
## (geom_point).

I tried a different approach here, used the combination of the two data sets. It can be seen, chloride amount under controllable levels (i.e. less than 0.15 g/dm3) has never effected the quality. It is only when saltiness increases we see samples start to get lower values of rating for both the wine types.

Sulphur Dioxide and Quality

SO2 with Quality (White wine on left and Red wine on right)

Most of the samples that get good quality ratings for white wine contain total SO2 in the range of 0-150 g/dm3 and Free SO2 in the range of 25-50 g/dm3. There are samples that have more SO2 content and ultimately get poor ratings from the experts. On the other hand, majority of the red wine samples that get higher rating have much lower amount of SO2. Looking at the ratings, it seems that ideal range of SO2 for red wine is less than 50 g/dm3 for total and less than 20 g/dm3 for free SO2.

Sulphur Dioxide and Alcohol with Quality

Scatter plot for Total Sulphur Dioxide and Alcohol with Qualtiy (White wine on left and Red wine on right)

NOTE: Again, outliers for SO2 content plots are excluded for better analysis.
## Warning in loop_apply(n, do.ply): Removed 6 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 9 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 10 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).

White wine: There does not seem to be a very good relation between the variables but to some extent samples with increase of alcohol and controlled amount of total and free SO2 tend to get higher ratings.
Red wine: Here it looks like there is very good relationship between the 3 variables. Specially for free SO2, it can clearly be observed that increase in both alcohol and free SO2 leads the experts give higher ratings.

Density and Quality

Density Plot of Density with Quality for White wine

Samples that have density controlled and close to 0.99 g/cm3 tend to high quality rating. Ratings starts to decrease as value of density increases and starts to approach 1.00 g/cm3.

Density Plot of Density with Quality for Red wine

Again for red wine also, most samples that have density close to and less than 0.995 g/cm3 get a higher rating. And as we go towards higher density, experts think the samples loosing and the wow factor and they give lower ratings.

Density and Alcohol with Quality

Scatter Plot for (Density vs. Alcohol) with Quality [Facet Wrap]

NOTE: Few density outliers removed from the plots for better representation.
## Warning in loop_apply(n, do.ply): Removed 3 rows containing missing values
## (geom_point).

Density and Residual Sugar with Quality for White wine

Scatter Plot for (Density vs. Residual Sugar) with Quality

NOTE: Scatter plot modified to exclude outliers for ease of visual representation.
## Warning in loop_apply(n, do.ply): Removed 5 rows containing missing values
## (geom_point).

Density and residual sugar for white wine have the highest correlation and it can be seen here in the plot. Adding a line for reference, a divide can be seen. For increasing residual sugar per sample if the density remains in levels below the imaginary line, samples get high ratings (7-9) and if it shoots up the threshold set by the line, the quality starts to suffer.

Density and Fixed Acidity with Quality for Red wine

Scatter Plot for (Density vs. Fixed Acidity) with Quality

Both the variables having very good correlation, do show a definite trend for quality. On average the samples with higher fixed acidic content (more than 8 g/dm3) and density of less than or equal to 0.996 /cm3 receive higher quality ratings. As the fixed acidity starts to go below 8 g/dm3 and density approaches 1.00 g/dm3 samples start to receive poor to mid range ratings.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Relationships explored in the multivariate section are as follows:
Alcohol and Quality with Taste: Density and scatter plots were used to explore the relationship between these features. For both the wine types, density plots showed that with increase in percentage of alcohol there was an increase in the quality rating for the samples. Taste due to pH was the additional variable added to the equation and it was observed that for white wine samples, with increase in alcohol they started to taste in the ‘Dry’ to ‘Medium Dry’ range and ultimately quality rating increased. On the other hand, for red wine, with introduction of taste due to pH there was not much of trend or a relation seen between these variables.
NOTE: Here samples for white wine in particular strengthen each other in terms of the variables alcohol, taste and quality.
Relation of Alcohol and pH with Quality: In general, according to the density plot for white wine, mid to low range quality rating did not really depend upon pH while for increasing pH there was improvement seen the ratings specially (8 and 9). Red wine samples told a different story, higher the pH value of the samples, lower the rating it got by the experts and vice versa. Scatter plots showed similar results for both wine types. Samples with lesser amount of alcohol and pH on the higher side tend to receive poor ratings while samples with alcohol greater than 11% and stable pH of around 3.2-3.3 received high ratings.
NOTE: I think pH, alcohol and quality strengthened each other for white wine samples.
Alcohol and Quality and Sulphates: For white wine samples, on average sulphate content did not effect the quality rating as such, however, red wine samples did show a trend. Samples with sulphate content around 0.7 g/dm3 received higher ratings compared to the ones with sulphates around 0.5 g/dm3. For scatter plots of white wine with alcohol, quality and sulphates it was difficult to pin point a particular trend. For red wine, there was a definite positive trend, i.e. higher the alcohol and sulphate content per sample, higher the ratings they got from wine experts.
NOTE: Sulphate and alcohol played a part in improving the quality rating, hence strengthening each other in case of red wine samples.
Relationship of Residual Sugar and Alcohol with Quality: The density plots for residual sugar with curves colored by quality rating did not really show a clear trend. Most of the quality curves are stacked in the same ranges of residual sugar for both the wine types. The scatter plot for white wine did show a relation between the variables, i.e. samples with more residual sugar did receive higher ratings. The zoomed version of scatter plot for red wine samples showed that for higher ratings, the alcohol content was observed be higher and residual sugar content was observed to be lower and vice versa.
NOTE: In my opinion, variables residual sugar, quality and alcohol do strengthen each other to some extent for samples of red wine.

Were there any interesting or surprising interactions between features?

Interaction of the top most highly correlated variables are as follows:
Density and Residual Sugar with Quality for White wine: Density and residual sugar for white wine showed the highest correlation with each other. I thought of investigating the trend of these two variables with our main variable of interest. I put all three on a scatter plot with color by quality and drew and imaginary line to roughly split the ratings. Line added a clear divide, for increase in residual sugar and density being contained within the threshold set by the line, quality of the samples were observed to improve. While on the other hand, for the samples with increasing residual sugar level and density level above the line threshold, experts gave lower quality ratings.
Density and Fixed Acidity with Quality for Red wine: For red wine, fixed acidity and density showed the highest correlation. I used these two with quality to see if there was a trend. Imaginary line for reference was again drawn here and there was a trend observed. For increasing fixed acidity per sample, density above the line threshold got good reviews i.e. high quality rating and density values less than the threshold received average to poor reviews.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = combined_wq)
## m2: lm(formula = quality ~ alcohol + pH, data = combined_wq)
## m3: lm(formula = quality ~ alcohol + pH + sulphates, data = combined_wq)
## m4: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5), 
##     data = combined_wq)
## m5: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide, data = combined_wq)
## m6: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)), data = combined_wq)
## m7: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity, 
##     data = combined_wq)
## m8: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid, data = combined_wq)
## m9: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)), data = combined_wq)
## m10: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)), 
##     data = combined_wq)
## m11: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar, data = combined_wq)
## m12: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)), data = combined_wq)
## m13: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + s.a.ratio, 
##     data = combined_wq)
## m14: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + s.a.ratio + 
##     taste, data = combined_wq)
## m15: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + s.a.ratio + 
##     taste + fixed.acidity, data = combined_wq)
## 
## ===================================================================================================================================================================================================================
##                                     m1          m2          m3          m4          m5          m6          m7          m8          m9          m10         m11         m12         m13         m14         m15    
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                       2.405***    2.982***    2.987***    3.023***    2.277***   -1.983***   -1.028**    -0.927*     -1.316**    -2.164***   -2.466***    -4.232***   -4.160***   -4.748***   -3.973***
##                                  (0.086)     (0.204)     (0.204)     (0.204)     (0.209)     (0.396)     (0.388)     (0.401)     (0.404)     (0.407)     (0.406)      (0.574)     (0.594)     (0.605)     (0.747)  
## alcohol                           0.325***    0.328***    0.329***    0.329***    0.348***    0.342***    0.323***    0.323***    0.381***    0.352***    0.303***     0.261***    0.261***    0.246***    0.238***
##                                  (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.011)     (0.012)     (0.013)      (0.016)     (0.016)     (0.017)     (0.018)  
## pH                                           -0.189**    -0.241***   -0.267***   -0.195**    -0.187**     0.073       0.055       0.024      -0.044       0.101        0.360***    0.359***    0.402***    0.417***
##                                              (0.061)     (0.062)     (0.063)     (0.062)     (0.061)     (0.061)     (0.064)     (0.063)     (0.063)     (0.065)      (0.088)     (0.088)     (0.090)     (0.090)  
## sulphates                                                 0.284***    0.385***    0.551***    0.681***    0.821***    0.834***    0.693***    0.547***    0.767***     0.829***    0.828***    0.844***    0.850***
##                                                          (0.066)     (0.076)     (0.076)     (0.076)     (0.074)     (0.075)     (0.077)     (0.078)     (0.082)      (0.083)     (0.083)     (0.083)     (0.083)  
## I(sulphates^5)                                                       -0.038**    -0.043**    -0.052***   -0.050***   -0.051***   -0.043**    -0.034**    -0.040**     -0.039**    -0.039**    -0.039**    -0.038** 
##                                                                      (0.014)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)      (0.013)     (0.013)     (0.013)     (0.013)  
## free.sulfur.dioxide                                                               0.007***   -0.009***   -0.009***   -0.008***   -0.009***   -0.013***   -0.014***    -0.014***   -0.014***   -0.015***   -0.015***
##                                                                                  (0.001)     (0.001)     (0.001)     (0.001)     (0.001)     (0.001)     (0.001)      (0.001)     (0.001)     (0.001)     (0.001)  
## I(free.sulfur.dioxide^(1/10))                                                                 3.433***    2.517***    2.499***    2.685***    4.660***    4.717***     4.700***    4.705***    4.799***    4.821***
##                                                                                              (0.272)     (0.268)     (0.269)     (0.269)     (0.319)     (0.317)      (0.317)     (0.317)     (0.316)     (0.316)  
## volatile.acidity                                                                                         -1.247***   -1.269***   -1.429***   -1.558***   -1.380***    -1.396***   -1.399***   -1.397***   -1.300***
##                                                                                                          (0.063)     (0.067)     (0.071)     (0.071)     (0.074)      (0.074)     (0.074)     (0.074)     (0.092)  
## citric.acid                                                                                                          -0.070      -0.206**    -0.131       0.013       -0.116      -0.118      -0.112      -0.045   
##                                                                                                                      (0.072)     (0.075)     (0.074)     (0.076)      (0.081)     (0.081)     (0.081)     (0.090)  
## I(log10(density))                                                                                                                78.845***   60.937***  -44.217*    -132.737*** -132.288*** -151.958*** -163.429***
##                                                                                                                                 (11.236)    (11.239)    (17.213)     (26.631)    (26.649)    (28.208)    (28.938)  
## I(log10(total.sulfur.dioxide))                                                                                                               -0.598***   -0.758***    -0.763***   -0.767***   -0.798***   -0.799***
##                                                                                                                                              (0.053)     (0.056)      (0.056)     (0.057)     (0.057)     (0.057)  
## residual.sugar                                                                                                                                            0.028***     0.043***    0.050**     0.024       0.033   
##                                                                                                                                                          (0.004)      (0.005)     (0.016)     (0.017)     (0.018)  
## I(log10(combined.acidity))                                                                                                                                             1.267***    1.196***    1.600***   -0.092   
##                                                                                                                                                                       (0.291)     (0.326)     (0.348)     (1.017)  
## s.a.ratio                                                                                                                                                                         -0.057       0.322*      0.262   
##                                                                                                                                                                                   (0.118)     (0.132)     (0.136)  
## taste: .L                                                                                                                                                                                      0.179       0.183   
##                                                                                                                                                                                               (0.559)     (0.559)  
## taste: .Q                                                                                                                                                                                      0.550       0.555   
##                                                                                                                                                                                               (0.408)     (0.407)  
## taste: .C                                                                                                                                                                                      0.365       0.367*  
##                                                                                                                                                                                               (0.187)     (0.187)  
## fixed.acidity                                                                                                                                                                                              0.089   
##                                                                                                                                                                                                           (0.050)  
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                            0.197       0.199       0.201       0.202       0.223       0.242       0.284       0.285       0.290       0.304       0.311       0.313       0.313       0.318       0.318 
## adj. R-squared                       0.197       0.198       0.200       0.201       0.222       0.241       0.284       0.284       0.289       0.303       0.309       0.311       0.311       0.316       0.316 
## sigma                                0.782       0.782       0.781       0.780       0.770       0.761       0.739       0.739       0.736       0.729       0.726       0.725       0.725       0.722       0.722 
## F                                 1597.641     804.749     544.025     410.377     372.584     344.582     368.568     322.613     294.371     282.976     265.644     245.759     226.845     188.468     177.624 
## p                                    0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000 
## Log-likelihood                   -7623.404   -7618.549   -7609.411   -7605.540   -7518.180   -7439.467   -7250.383   -7249.909   -7225.344   -7161.667   -7129.474   -7120.000   -7119.884   -7096.585   -7095.014 
## Deviance                          3975.734    3969.796    3958.645    3953.930    3849.017    3756.874    3544.442    3543.924    3517.226    3448.953    3414.943    3404.997    3404.876    3380.543    3378.909 
## AIC                              15252.809   15245.098   15228.821   15223.079   15050.360   14894.933   14518.766   14519.817   14472.687   14347.334   14284.948   14267.999   14269.768   14229.170   14228.028 
## BIC                              15273.146   15272.214   15262.717   15263.754   15097.814   14949.166   14579.778   14587.608   14547.257   14428.683   14373.076   14362.907   14371.454   14351.194   14356.831 
## N                                 6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497     
## ===================================================================================================================================================================================================================
I created a model for predicting the quality of wine samples. The variables used in this model are combination of original plus the derived and log transformed variables. Although, the linear model only showed an r^2 of 0.32, I tried to put in as useful features I could. Feature selection was a combination of correlation coefficient calculations performed in the previous sections and trial & error used to check which combination (log or power multiples) gave better results.

Final Plots and Summary

Plot One

Description One

First plot is chosen from the univariate section. It shows the frequency polygons for the distribution of alcohol for White and Red wine samples. The frequency polygons show that for most of the samples for both wine types, alcohol percentage is between 9% and 13%. Apart from the difference in the number of samples, just by looking at the plot it seems that both wine types have same composition of alcohol with multiple peaks 9.5, 10.1, 10.5 and 11.1% for white wine and 9.5, 9.8, 10.5 and 10.9% for red wine samples. There seems to be a similarity in the shapes of the polygons too and reminds me of a geometric transformation with approximate enlargement factor of 2.

Plot Two

Description Two

Plot two is a combination of quality vs. alcohol box plots for both wine types, take from the bivariate section of this report. Stat summary function is used to plot mean value of alcohol for each quality rating bucket denoted by a green cross inside a box plot. For both wine types, mean and median alcohol levels for lower quality ratings (3-5) show a slight drop but as we progress through higher quality buckets there is an increasing trend seen in percentage of alcohol. There is no red wine sample that got quality rating of 9 from the experts.

Plot Three

Description Three

Third plot is derived from the multivariate section. Frequency polygon and box plot have already been discussed in the previous plots so I thought of including a scatter plot. Legend for quality is set in such a way that orangish tint corresponds towards lower quality rating and greenish tint towards higher quality rating and only quality rating buckets (5-7) with most samples are chosen. Considering only the quality buckets with most samples would help understand the trend. The data subset used for this plot also excludes outliers for both pH and sulphates. I have used a combination of facet wrap and grid arrange to show comparison between pH and sulphate.
White wine
For white wine, the scatter plots generally show that with increase of sulphate and alcohol content experts tend to give higher quality ratings to the samples. I can say the same for pH and Alcohol, increase in values of both these variables, improves quality rating. Let us look at the summary and r^2 values as part of this investigation.
Summary of Sulphate, pH and Alcohol content for White wine
pH
## sub_q_wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820
SO4
## sub_q_wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2700  0.4200  0.4700  0.4822  0.5300  0.8800 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4800  0.5031  0.5800  1.0800
Alcohol
## sub_q_wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## sub_q_wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20
r2 table of Sulphate, pH and Alcohol content for White wine
##             pH sulphates alcohol quality
## pH        1.00      0.15    0.12    0.10
## sulphates 0.15      1.00   -0.02    0.06
## alcohol   0.12     -0.02    1.00    0.45
## quality   0.10      0.06    0.45    1.00
We can see that with increase in quality rating, mean and median values for pH, alcohol and sulphate content generally increase confirming the analysis done by looking at the scatter plots. Correlation table also shows a positively increasing relationship between quality and all the other 3 variables with alcohol being the strongest and SO4 being the weakest. Generally for the data set available and majority of the samples of White wine, increase in alcohol, pH and SO4 content improves quality.
Red wine
It can be observed from the plot that there is a slight positive correlation between SO4 and alcohol which in turn improves the quality ratings. On the other hand, samples with increase in pH and decrease in alcohol receive lower ratings while the ones with relatively low pH and higher alcohol content seem to receive higher quality ratings. Summary table and r^2 values are as follows for further investigation.
Summary of Sulphate, pH and Alcohol content for Red wine
pH
## sub_q_wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780
SO4
## sub_q_wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600
Alcohol
## sub_q_wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## sub_q_wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00
r2 table of Sulphate, pH and Alcohol content for Red wine
##              pH sulphates alcohol quality
## pH         1.00     -0.18     0.2   -0.01
## sulphates -0.18      1.00     0.1    0.24
## alcohol    0.20      0.10     1.0    0.50
## quality   -0.01      0.24     0.5    1.00
Mean and median values for alcohol and sulphate content generally increase with improving quality ratings confirming what was observed in the scatter plots. In contrast to this, mean and median values for pH fluctuates with improving quality. The pH increases from quality rating of 5-6 and then decreases for quality rating 7. This fluctuation results in a negative correlation also confirmed by the r^2 value of -0.01. For majority of the Red wine samples, increase in alcohol and SO4 improves the quality rating as seen in the correlation table while pH has negligible effect on quality.

Reflection

Analysis and Obervations

The analysis of the two wine data sets kicked off with the selection of main features. As the main objective of the project was to determine the quality rating of a particular wine sample based on its chemical composition, ‘quality’ and a very close associated variable ‘alcohol’ were chosen to be the two main features. Main features held the center stage and analysis revolved around them. In addition to the main features, some of the other variables in the data sets were considered as supporting features.
In the univariate analysis section for the project, I had assumed that along with the main features, supporting features like residual sugar and acidic contents (fixed, volatile and citric acid) would play an important role in the analysis. I had also thought features like pH, density and sulphate content would be very useful in determining the quality of wine samples.
All feature selection was considered for both the wine types in general. Conducting extensive comparative analysis of the features in bivariate and multivariate sections revealed some interesting relationships and combinations that might affect the quality of wine samples.
The sequence of the analysis conducted and the successes/difficulties faced during the analysis of the two wine data sets is summarized below:
I started with analyzing the relationship between quality and alcohol. Correlation coefficient, regression line and the scatter plot showed positive relation between the two, however, the exact trend was difficult to pin point. I took an alternative approach and created box plots and showed mean alcohol levels for each quality bucket. The box plots and the mean values showed that although there was a positive correlation but it was not linear for all the buckets of quality. For both the wine type, alcohol content at lower quality ratings, showed degradation and then from quality rating of 5 onwards there was a positive linear trend observed.
Evidence of this observation is shown in the correlation coefficient calculation done below (I have not included this in an r chunk as this just for illustration purposes). Sub-setting by quality, both the wine types are compared for lower and upper buckets of quality ratings.

Comparison of r^2 for subsets for White wine

cor.test(alcohol, quality) = 0.44
with(subset(wqw, quality < 5), cor.test(alcohol, quality)) = -0.06
with(subset(wqw, quality > 4), cor.test(alcohol, quality)) = 0.47

Comparison of r^2 for subsets for Red wine

cor.test(alcohol, quality) = 0.48
with(subset(wqr, quality < 5), cor.test(alcohol, quality)) = 0.12
with(subset(wqr, quality > 4), cor.test(alcohol, quality)) = 0.52
Apart from the dipping trend in the relationship, there was an overall positive correlation.
In addition to alcohol and quality, I also thought pH could help decide the quality of a particular wine sample. To find out the relation between these variables I started off by creating scatter plots & linear regression line and calculating r^2 between alcohol and pH. For the both wine types, I did get slightly positive results. I further plotted box plots to check relation between quality and pH for white wine samples. Although, the correlation seemed poor but there was a similar trend seen as between alcohol vs. quality (pH dropped for low quality samples and then very slightly increased for the samples with high quality ratings). I left this exploration to be further investigated in the multivariate section creating density plots for pH with curves representing each quality rating bucket. For both the wine types, most of the samples were in the pH range between 3.0-3.4 and pH did not favored any particular quality rating as most of the curves were clustered around the range mentioned above. Finally, I created scatter plots for alcohol vs. pH and points colored by quality. With less alcohol content and pH on a higher side, experts tend to give lower ratings. In contrast to this for more alcohol content and pH controlled leads the expert to rank the samples high. All this exercise revolving around pH did not turn out to be as expected. I had thought that pH would play a very important part in decision making for quality but to my surprise the relationship although positive was very poor in order for me to be convinced that pH could play a decisive part.
Along with pH, I thought acidic content would be an important factor too. Taking my analysis a step further I plotted density curves in the multivariate section for all acidic types with respect to each quality bucket. Density plots of fixed and volatile acidity for both wine types showed similar variations with quality rating being high for samples that have acidic levels under control i.e. around 6-8 g/dm^3 for fixed acidity and 0.3-0.4 g/dm^3 for volatile acidity. Citric acid plot revealed that red wine quality did get better with increase in the acid content while white wine plots remained inconclusive. These plots were followed by scatter diagrams which showed that for white wine acidic content did not play a very important role compared to red wine samples. Fixed acidity and citric acid for red wine seemed to be the main contributing factors towards quality.
Sugar is an integral part of any drink, as is the case in wines. Creating density plots and scatter diagrams revealed that white wine samples had more residual sugar compared to red wine. Quality ratings were mixed and sugar content did seem to effect the quality of a particular samples for both the wine types.
To generalize all the analysis conducted I have come up with the following equations to roughly relate the some of the variables in the data sets.
White wine: Quality ~ Alcohol + pH(to some extent) + Residual Sugar
Red wine: Quality ~ Alcohol + pH(to some extent) + Sulphates

Limitations

One of the limitations of the data set that I observed was that it was taken for only one type of wine (Vinho Verde). If more wine data from other wine types/manufacturers was combined with the existing data set it could help us generalize this analysis and apply it to wine all over the world in general.

Questions and Future work

Additional variables could have been added to the data set in order for better analysis and prediction. I added two variables ‘taste and ’taste due to pH’ mentioned in the univariate analysis section in order for better understanding and prediction for the quality of the wine samples. The data used to create the variables was based on assumption for riesling wine and was only used to add categorical variables. Based on what I analyzed using these variables, it would have been very useful if the original wine data sets had included a variable such as this for taste for Vinho Verde wines in specific. So future work suggestion would be to enhance the data set by adding a variable to account for taste.

Struggles during the analysis

Some for the difficulties I faced while creating this report and coding in R are documented below:
Adding additional categorical variables
During the analysis for this project, I thought addition of an extra categorical variable other than quality would be very helpful to determine interesting relationship between the variables. For that I kept searching and came across this url:
Although, the table here was suitable for riesling wine, I thought it would be good if I could implement this in R and apply it to the data sets. Doing a bit of web searching regarding ’adding categorical variables to a data set in R" I came across the following url:
This helped to add the four categories of taste and taste due to pH. Then I realized while coding to create summary tables, for taste and taste due to pH, that had the data type character and I was unable to put summary operator on them. I went back to the diamond data set in R and checked the data types for clarity, color, carat, etc. These were all defined as ordered factors, which led me to do some searching and find a solution to this problem. I got to the following url which aided me to apply this to my newly created variables and to use them in summary tables.
Using Frequency Polygons
As mentioned in the univariate analysis section, the histograms for alcohol showed multi-modality when they were first created. It was hard to interpret the trend and compare the alcohol distribution for the two data sets, so I remembered back in lesson 3 of ‘Data Analysis with R’ it was explained how to use frequency polygons as an alternative for representing distributions. Along with the code in lesson 3 and url:
I managed to combine frequency polygons and it became easy to view the variations in alcohol content.
Adding Thematic and Axis labels
The frequency polygons created in the univariate section were also used in the ‘Plots and Summary’ section. To enhance the graph in this section I had to search on how to put themes and titles to the plot in R. Adding Title and axes labels was easier by following the code given in lesson 3 in the frequency polygon section’s instructor notes. Next task was to add themes and axes labels, this was difficult as I had to look for the appropriate themes and label codes. Some time was spent on web search to find the appropriate R code to implement this. I got a lot help in this regard from the urls below:
Changing Binwidth and Scale breaks
Use of Box plots
Scatter plot, regression line and correlation coefficient for quality vs. alcohol in the bivariate section did show a positive relation between the two variables but did not reveal the variation in the trend. I recalled the R code covered for box plots in the course material of ‘Data Analysis with R’ and used it to plot alcohol vs. quality. At first I used the following piece of R code:
ggplot(aes(x = quality, y = alcohol), data = wqw)…..
This created one huge box plot spanning overall quality rating bins. I was really confused and surprised as I thought this plotting would have been simple. Failing to figure out the exact cause for such an unusual plot, suddenly I recalled the as.ordered functionality I used to convert characters to ordered factors. Then using the following code I was able to create different box plots for all quality rating buckets.
ggplot(aes(x = as.ordered(quality), y = alcohol), data = wqw)…..
Appropriate Axis labels, Colors and Legend
Setting colors and background in the plots was the trickiest. This was done by using the scale_colour_manual and scale_fill_manual functionalities and following the examples in the urls below:
The color scheme for the scatter plots in particular, diverging from red to green, was inspired by cellular energy level symbols that I normally use at work as a telecom engineer. The following url will give an idea:
Another source that was really helpful to choose the exact hex color code to be used in the R code for scale_colour_manual or scale_fill_manual was:
Placement of the legend tile on the plot was also tricky. Placing it in RStudio did not come out the same way when knitting the file. So it was done by trial and error to make sure the legend tile (specifically for quality in the 3rd final plot) fit appropriately on the plot. For this purpose a bit of experimentation was done on the two sub functions of theme, legend position and legend justification based on the examples in the url:
Scale settings
Some of the variables in the data sets, white wines in particular, had a few outliers. It was difficult to interpret the trend when plotting histograms and scatter plots. So in most cases I had to apply appropriate scaling in order to view the trend. This was done mostly by trial and error and making sure that i did not chop off more than 15 rows of outliers for any of the plots. xlim and ylim came in very handy for some of the plots such as histogram for density, scatter plot of density vs. residual sugar for white wine, scatter plot for red wine (sulphates vs. alcohol) with quality and scatter Plot for white wine (residual sugar vs. alcohol) with quality to name a few.

References